Red Wine Quality Exploration by Hai Xiao

Data Summary

## [1] 1599   12
##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"
## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000
##        fixed.acidity     volatile.acidity          citric.acid 
##          0.209275508          0.339243549          0.718888086 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##          0.555350954          0.538094924          0.658910770 
## total.sulfur.dioxide              density                   pH 
##          0.707916662          0.001893494          0.046626755 
##            sulphates              alcohol              quality 
##          0.257551132          0.102242090          0.143287121
## citric.acid 
##   0.7188881
##     density 
## 0.001893494
##         pH 
## 0.04662676

According to requirement, we need to analyze which chemical properties (minimum 8 independent variables) influence the quality of red wines. With this data, we already have 12 variables, presumably:

  • quality - as the dependent or response variable

  • all others - the independent variables

We also learned that ‘density’ and ‘pH’ have very small CV (less than 0.2% and 4.7% respectively) compare to other features within given dataset, therefore I tend to ignore ‘density’ and ‘pH’ as independent variables of chemical properties in Univariate Plot and Analysis because:

+ their CV are too small in contribute to a practical prediction, and prone to measurement errors
+ they are actually physical properties (response variable) decided by other 9 variables of chemical properties
##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "sulphates"            "alcohol"             
## [10] "quality"

Univariate Plots Section

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000
we see that ‘quality’ is close to a normal distribution

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90
we see that ‘alcohol’ is close to a normal distribution Skewed to the Right

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000
##   96.37% 
## 0.999926
with 96.37% samples, ‘sulphates’ is below 1.0 (or 0.999926)
we see that the ‘sulphates’ distribution is similar to a normal distribution with long tail Skewed to the Right!
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

## 99.5% 
##   151

we see that ‘total.sulfur.dioxide’ is close to a normal distribution Skewed to the Right
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

## 99.5% 
## 53.01

we see that ‘free.sulfur.dioxide’ is close to a normal distribution Skewed to the Right
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

##   99.5% 
## 0.41401

we see that ‘chlorides’ distribution is close to a normal distribution (this is more obvious after cutting long tail to the right)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

##  99.5% 
## 11.019

we see that ‘residual.sugar’ distribution is close to a normal distribution after cutting its long tail to the right
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

## 99.5% 
##  0.74

we see that ‘citric.acid’ distribution is not at all a normal distribution; instead it’s some close a bimodal with two peaks at 0.00 and 0.49
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

##   99.5% 
## 1.09025

## 
## 0.36 0.58 0.59 0.43  0.5  0.6 
##   38   38   39   43   46   47
we see that ‘volatile.acidity’ distribution is close but not really a bimodal, with three peaks at 0.43, 0.50 and 0.60
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

we see that ‘fixed.acidity’ distribution is close to a Skewed normal distribution

Univariate Analysis

What is the structure of your dataset?

## 'data.frame':    1599 obs. of  10 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

There are 1599 red wine samples (observations) in the dataset with 9 chemical features:

  • fixed.acidity
  • volatile.acidity
  • citric.acid
  • residual.sugar
  • chlorides
  • free.sulfur.dioxide
  • total.sulfur.dioxide
  • sulphates
  • alcohol

and two physical measurement dependent variables:

  • density
  • pH

and one main investigation feature (human rated red wine quality scores):

  • quality

What is/are the main feature(s) of interest in your dataset?

The main feature of interest in the dataset is the quality. I’d like to explore how different chemical features are contributing to Good or Bad wine quality, and their contribution to wine’s physical properties e.g. density, pH (included in dataset red vs. reds) as well.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

density, pH themselves are mostly decided by the composition of other 9 chemical elements.

Because of this even though themselves can not be the causation of wine quality, they may carry a great correlation to wine quality in the limited variance space of wine making process.

Hopefully they can be used to pair with other features to expose notable correlations with wine quality in analysis.

Did you create any new variables from existing variables in the dataset?

Not so far, instead I purged the variable ‘X’ from original dataframe from csv import, provided that I’m not interested in sample sequence analysis (assuming a random sequence).

## [1] "integer"
## [1] 3 8

Considering quality is an ordinal variable of int (Min. 3 ~ Max. 8), later I will factor it to an additional categorical variable called ‘qua’.

It will be extremely helpful in visualizing other variables change along with new categorical variable ‘qua’, either in color, facet histogram or boxplots.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

##      fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## [1,]           4.6             0.12           0            0.9     0.012
## [2,]          15.9             1.58           1           15.5     0.611
##      free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates
## [1,]                   1                    6 0.99007 2.74      0.33
## [2,]                  72                  289 1.00369 4.01      2.00
##      alcohol quality
## [1,]     8.4       3
## [2,]    14.9       8

The red wine data set is already tidy, and no abnormality (e.g. negative) is seen from above range check.

I tend to believe the data of all features fall in an acceptable error range (Sorry I can NOT confirm this - since I don’t have multiple measurements of single sample)

Though in the Univariate Plots above, I do observe long tailed data samples in a few features, I cut them (negligible in persentage) to better visualize the target features.


Bivariate Plots Section

From this section, we come back to use dataframe ‘red’ (including density and pH).
##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity           1.00000000     -0.256130895  0.67170343
## volatile.acidity       -0.25613089      1.000000000 -0.55249568
## citric.acid             0.67170343     -0.552495685  1.00000000
## residual.sugar          0.11477672      0.001917882  0.14357716
## chlorides               0.09370519      0.061297772  0.20382291
## free.sulfur.dioxide    -0.15379419     -0.010503827 -0.06097813
## total.sulfur.dioxide   -0.11318144      0.076470005  0.03553302
## density                 0.66804729      0.022026232  0.36494718
## pH                     -0.68297819      0.234937294 -0.54190414
## sulphates               0.18300566     -0.260986685  0.31277004
## alcohol                -0.06166827     -0.202288027  0.10990325
## quality                 0.12405165     -0.390557780  0.22637251
##                      residual.sugar    chlorides free.sulfur.dioxide
## fixed.acidity           0.114776724  0.093705186        -0.153794193
## volatile.acidity        0.001917882  0.061297772        -0.010503827
## citric.acid             0.143577162  0.203822914        -0.060978129
## residual.sugar          1.000000000  0.055609535         0.187048995
## chlorides               0.055609535  1.000000000         0.005562147
## free.sulfur.dioxide     0.187048995  0.005562147         1.000000000
## total.sulfur.dioxide    0.203027882  0.047400468         0.667666450
## density                 0.355283371  0.200632327        -0.021945831
## pH                     -0.085652422 -0.265026131         0.070377499
## sulphates               0.005527121  0.371260481         0.051657572
## alcohol                 0.042075437 -0.221140545        -0.069408354
## quality                 0.013731637 -0.128906560        -0.050656057
##                      total.sulfur.dioxide     density          pH
## fixed.acidity                 -0.11318144  0.66804729 -0.68297819
## volatile.acidity               0.07647000  0.02202623  0.23493729
## citric.acid                    0.03553302  0.36494718 -0.54190414
## residual.sugar                 0.20302788  0.35528337 -0.08565242
## chlorides                      0.04740047  0.20063233 -0.26502613
## free.sulfur.dioxide            0.66766645 -0.02194583  0.07037750
## total.sulfur.dioxide           1.00000000  0.07126948 -0.06649456
## density                        0.07126948  1.00000000 -0.34169933
## pH                            -0.06649456 -0.34169933  1.00000000
## sulphates                      0.04294684  0.14850641 -0.19664760
## alcohol                       -0.20565394 -0.49617977  0.20563251
## quality                       -0.18510029 -0.17491923 -0.05773139
##                         sulphates     alcohol     quality
## fixed.acidity         0.183005664 -0.06166827  0.12405165
## volatile.acidity     -0.260986685 -0.20228803 -0.39055778
## citric.acid           0.312770044  0.10990325  0.22637251
## residual.sugar        0.005527121  0.04207544  0.01373164
## chlorides             0.371260481 -0.22114054 -0.12890656
## free.sulfur.dioxide   0.051657572 -0.06940835 -0.05065606
## total.sulfur.dioxide  0.042946836 -0.20565394 -0.18510029
## density               0.148506412 -0.49617977 -0.17491923
## pH                   -0.196647602  0.20563251 -0.05773139
## sulphates             1.000000000  0.09359475  0.25139708
## alcohol               0.093594750  1.00000000  0.47616632
## quality               0.251397079  0.47616632  1.00000000

Observations:

  • fixed.acidity is strongly correlated to citric.acid (0.67170343), density (0.66804729) and pH(-0.68297819), volatile.acidity(-0.256130895). It expains that:
    • fixed.acidity is the acidity dissolved in water, and citric.acid is water-soluble
    • fixed.acidity reflects the acid elements added to water, and those acid molecular normally weighs more than water molecular
    • acidity cause pH lower, more acidity the lower the pH
    • fixed.acidity and volatile.acidity are complement
  • volatile.acidity is strongly correlated to citric.acid(-0.55249568), pH(0.23493729) and sulphates(-0.260986685), quality(-0.39055778). This explained as:
    • volatile.acidity is not due to additive of citric.acid, citric.acid is water-soluble vs. volatile
    • volatile.acidity might cause unpleasant, vinegar taste if it is in high level
  • citric.acid is strongly correlated to density(0.36494718), pH(-0.54190414) and sulphates(0.312770044), quality(0.22637251). This explained as:
    • citric.acid additive is acid to cause lower pH value, and its molecular is heavier than water’s
    • citric.acid can boost wine quality, as it can add ‘freshness’ and flavor to wines.
  • residual.sugar is strongly correlated to density(0.35528337), but not to quality(0.01373164). This explained as:
    • residual.sugar reflects sugar remaining in the wine, and sugar molecular is heavier than water’s
  • chlorides is strongly correlated to sulphates(0.371260481), pH(-0.26502613), density(0.20063233) and alcohol(-0.22114054), it may lead to lower quality(-0.12890656). This explained as:
    • chlorides reflects the salt additive in the wine. It is heavier then water and normally cause acidity with lower pH value.
    • too much chlorides/salt may react with alcohol to lower the alcohol concentration, which in turn lower some wine quality.
  • free.sulfur.dioxide is strongly correlated to total.sulfur.dioxide(0.66766645). This explained as:
    • free.sulfur.dioxide is a part of total.sulfur.dioxide, in free form of SO2, while total.sulfur.dioxode also contains bound forms of SO2.
  • total.sulfur.dioxide is some correlated to alcohol(-0.20565394) and quality(-0.18510029). This explained as:
    • higher level SO2 in wine can react with alcohol, further lead to lower alcohol level and quality.
  • sulphates is strong correlated to citric.acid(0.31277004), chlorides(0.371260481) and quality(0.25139708). It explained as:
    • other than additive sulphates, it may be produced with the help from citric.acid and chlorides additives
    • sulphates acts as antimicrobial and antioxidant to keep wine fresh, so it may help with wine quality as well.
    • Intially it is hard to tell why sulphates has weak correlation with both free.sulfur.dioxide and total.sulfur.dioxid. But a deeper read of introductial material tells that sulphates is (potassium sulphate - K2SO4), which is not easily converted from SO2.
  • alcohol is strongly correlated to quality(0.47616632), density(-0.49617977). This explained as:
    • alcohol as main ingredient of wine or any liquior, it is positively contributing to wine quality.
    • alcohol is lighter than water, so higher the alcohol level, the lower the density is.
  • additionally density is strongly correlated to alcohol(-0.49617977), and some to quality(-0.17491923). This explained as:
    • alcohol is lighter than water, so higher the density, lower the alcohol and wine quality may be.
  • additionally pH is notably correlated to alcohol(0.20563251), and sulphates(-0.196647602), and very little to quality(-0.05773139). This explained as:
    • pH itself is just a measurement, it could be higher by alcohol and lower by sulphates, but both alcohol and sulphates contributes to higher quality alcohol, so correlation of pH to wine quality is minimum.
plots of correlations matrix with better visualization (colored by quality)

Red Wine feature correlations

Plots like the one above are very helpful, among others things, in the pre-processing stage of a classification problem, where you want to analyze your predictors given the class labels. That produces a pairwise comparison of multivariate data.
##      fixed.acidity volatile.acidity citric.acid residual.sugar  chlorides
## [1,]     0.1240516       -0.3905578   0.2263725     0.01373164 -0.1289066
##      free.sulfur.dioxide total.sulfur.dioxide    density          pH
## [1,]         -0.05065606           -0.1851003 -0.1749192 -0.05773139
##      sulphates   alcohol
## [1,] 0.2513971 0.4761663
##        alcohol volatile.acidity sulphates citric.acid total.sulfur.dioxide
## [1,] 0.4761663       -0.3905578 0.2513971   0.2263725           -0.1851003
##         density  chlorides fixed.acidity          pH free.sulfur.dioxide
## [1,] -0.1749192 -0.1289066     0.1240516 -0.05773139         -0.05065606
##      residual.sugar
## [1,]     0.01373164
##        alcohol volatile.acidity sulphates citric.acid total.sulfur.dioxide
## [1,] 0.4761663       -0.3905578 0.2513971   0.2263725           -0.1851003
##       chlorides fixed.acidity
## [1,] -0.1289066     0.1240516
plots of most relevant correlations to wine quality

Red Wine key feature correlations

Next let’s look into the bivariate plots between two major and notably correlated variables
Quality vs. Alcohol

## red$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.400   9.725   9.925   9.955  10.580  11.000 
## -------------------------------------------------------- 
## red$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00    9.60   10.00   10.27   11.00   13.10 
## -------------------------------------------------------- 
## red$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.5     9.4     9.7     9.9    10.2    14.9 
## -------------------------------------------------------- 
## red$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.80   10.50   10.63   11.30   14.00 
## -------------------------------------------------------- 
## red$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.50   11.47   12.10   14.00 
## -------------------------------------------------------- 
## red$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.80   11.32   12.15   12.09   12.88   14.00

Here we see a clear trend that quality increases as alcohol does, except for the middle quality of 5 bucket, which holds the lowest statistic alcohol level (both mean=9.9% and median)

In this picture, through the simulated alcohol vs. quality correlation via lm, we see their trend clear.

Quality vs. volatile.acidity

## red$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4400  0.6475  0.8450  0.8845  1.0100  1.5800 
## -------------------------------------------------------- 
## red$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.230   0.530   0.670   0.694   0.870   1.130 
## -------------------------------------------------------- 
## red$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.180   0.460   0.580   0.577   0.670   1.330 
## -------------------------------------------------------- 
## red$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1600  0.3800  0.4900  0.4975  0.6000  1.0400 
## -------------------------------------------------------- 
## red$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3000  0.3700  0.4039  0.4850  0.9150 
## -------------------------------------------------------- 
## red$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2600  0.3350  0.3700  0.4233  0.4725  0.8500

Here we see a clear trend that quality increases as volatile acidity decreases.

In this picture, through the simulated volatile acidity vs. quality correlation via lm, we see a clear trend.

Quality vs. sulphates

## red$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4000  0.5125  0.5450  0.5700  0.6150  0.8600 
## -------------------------------------------------------- 
## red$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.4900  0.5600  0.5964  0.6000  2.0000 
## -------------------------------------------------------- 
## red$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.370   0.530   0.580   0.621   0.660   1.980 
## -------------------------------------------------------- 
## red$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4000  0.5800  0.6400  0.6753  0.7500  1.9500 
## -------------------------------------------------------- 
## red$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3900  0.6500  0.7400  0.7413  0.8300  1.3600 
## -------------------------------------------------------- 
## red$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.6300  0.6900  0.7400  0.7678  0.8200  1.1000

Here we see a clear trend that quality increases as sulphates increases.

In this picture, through the simulated sulphates vs. quality correlation via lm, we see a trend that higher sulphates may contribute to higher quality, BUT as this growths the prediction errors also increases (wider bean of smooth to the right).

Quality vs. Citric Acid

## red$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0050  0.0350  0.1710  0.3275  0.6600 
## -------------------------------------------------------- 
## red$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0300  0.0900  0.1742  0.2700  1.0000 
## -------------------------------------------------------- 
## red$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0900  0.2300  0.2437  0.3600  0.7900 
## -------------------------------------------------------- 
## red$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0900  0.2600  0.2738  0.4300  0.7800 
## -------------------------------------------------------- 
## red$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.3050  0.4000  0.3752  0.4900  0.7600 
## -------------------------------------------------------- 
## red$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0300  0.3025  0.4200  0.3911  0.5300  0.7200

Here we see a clear trend that quality increases as citric acid increases.

In this picture, through the simulated citric acid vs. quality correlation via lm, we see a trend that higher sulphates may contribute to a little higher quality, BUT as this growths the prediction errors also increases (wider bean of smooth to the right).

Quality vs. total SO2

## red$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0    12.5    15.0    24.9    42.5    49.0 
## -------------------------------------------------------- 
## red$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.00   14.00   26.00   36.25   49.00  119.00 
## -------------------------------------------------------- 
## red$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   26.00   47.00   56.51   84.00  155.00 
## -------------------------------------------------------- 
## red$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   23.00   35.00   40.87   54.00  165.00 
## -------------------------------------------------------- 
## red$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.00   17.50   27.00   35.02   43.00  289.00 
## -------------------------------------------------------- 
## red$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.00   16.00   21.50   33.44   43.00   88.00

Here we see that higher quality wine has slightly lower total SO2 level. The total SO2 level statistically peaks at quality level 5.

In this picture, through the simulated total SO2 vs. quality correlation via lm, we see a trend that higher total SO2 level may contribute to a little lower quality, BUT with growing prediction errors (wider bean of smooth to the right).

This increasing prediction error can be depicted by a high order polynomial regression below. And this may due to a few outlinear samples at high quality level (7) and high total SO2.

Quality vs. Chlorides (Sodium)

## red$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0610  0.0790  0.0905  0.1225  0.1430  0.2670 
## -------------------------------------------------------- 
## red$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.04500 0.06700 0.08000 0.09068 0.08900 0.61000 
## -------------------------------------------------------- 
## red$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.03900 0.07400 0.08100 0.09274 0.09400 0.61100 
## -------------------------------------------------------- 
## red$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.03400 0.06825 0.07800 0.08496 0.08800 0.41500 
## -------------------------------------------------------- 
## red$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.06200 0.07300 0.07659 0.08700 0.35800 
## -------------------------------------------------------- 
## red$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.04400 0.06200 0.07050 0.06844 0.07550 0.08600

Here we see higher quality wine has slightly lower chloride level. But chlorides level statistically does NOT change much cross different quality levels.

In this picture, through the projected chlorides vs. quality correlation via lm, we see a trend that higher chlorides level may contribute to little lower quality, with growing prediction errors (fast wider bean of smooth to the right).

This increasing prediction error can be depicted by a high order polynomial regression below.

## red$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   6.700   7.150   7.500   8.360   9.875  11.600 
## -------------------------------------------------------- 
## red$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.600   6.800   7.500   7.779   8.400  12.500 
## -------------------------------------------------------- 
## red$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.000   7.100   7.800   8.167   8.900  15.900 
## -------------------------------------------------------- 
## red$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.700   7.000   7.900   8.347   9.400  14.300 
## -------------------------------------------------------- 
## red$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.900   7.400   8.800   8.872  10.100  15.600 
## -------------------------------------------------------- 
## red$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.000   7.250   8.250   8.567  10.230  12.600

Here we see higher quality wine has slightly higher fixed acidity. But fixed acidity level statistically peaks at quality level (7).

In this picture, through the projected chlorides vs. quality correlation via lm, we see a trend that higher fixed acidity level may contribute to a slightly higher quality, with growing prediction errors (the wider bean of smooth to the right).

We have not specifically analysis ‘density’ and ‘pH’ with other features in this bivariate plots section, even though we have their correlations with other features as:
##      fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## [1,]     0.6680473       0.02202623   0.3649472      0.3552834 0.2006323
##      free.sulfur.dioxide total.sulfur.dioxide density         pH sulphates
## [1,]         -0.02194583           0.07126948       1 -0.3416993 0.1485064
##         alcohol    quality
## [1,] -0.4961798 -0.1749192
##      fixed.acidity volatile.acidity citric.acid residual.sugar  chlorides
## [1,]    -0.6829782        0.2349373  -0.5419041    -0.08565242 -0.2650261
##      free.sulfur.dioxide total.sulfur.dioxide    density pH  sulphates
## [1,]           0.0703775          -0.06649456 -0.3416993  1 -0.1966476
##        alcohol     quality
## [1,] 0.2056325 -0.05773139
We choose not to specifically analysis ‘density’ and ‘pH’ with other features in bivariate analysis is because:
  • I consider ‘density’ and ‘pH’ both non-independent variables, in general the fact that they are determined by the composition of others should hold true
  • With one exception in mind, that is the water (base quality) from different wineries are little different (e.g. their pH is not always ‘7’ as pure water), but I’ll ignore this variation for now, given that the gross correlation between ‘pH’ and ‘quality’ is still low (-0.05773139).
  • More clearer demonstration between ‘density’, other feature and ‘quality’ will be given in Multivariate analysis.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Grossly I classified 12 features to 3 classes:

  • 9 independent chemical variables:
    • fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, sulphates, alcohol
  • 2 dependent (observation decided by above 9) variables:
    • density and pH
  • 1 response (target feature by human judge) variable:
    • quality (factored to qua)

Observations:

  • certain strong correlations exist among 9 chemical features, notably:

    ## [1] 0.6676665
    ##      fixed.acidity volatile.acidity
    ## [1,]     0.6717034       -0.5524957
    • this is relatively easy to understand, as:
      • free SO2 and total SO2 are highly correlated, due to both may come from a common additive
      • citric acid is water-soluable, which may contribute to fixed acidity in water (of wine)
      • volatile acidity reflects acid not dissolved in water, counterpart of fixed acidity in wine
  • density and pH have strong correlations to certain chemical features, notably:

    ##      fixed.acidity    alcohol citric.acid residual.sugar
    ## [1,]     0.6680473 -0.4961798   0.3649472      0.3552834
    ##      fixed.acidity citric.acid  chlorides volatile.acidity   alcohol
    ## [1,]    -0.6829782  -0.5419041 -0.2650261        0.2349373 0.2056325
    • this is relatively easy to understand, as:
      • heavier chemical molecular from those components increases density
      • pH properties from those components changes overall pH value of wine/water (neutral value = 7), with acid bring down pH
  • quality have notable correlations to 7 chemical features:

    ##        alcohol volatile.acidity sulphates citric.acid total.sulfur.dioxide
    ## [1,] 0.4761663       -0.3905578 0.2513971   0.2263725           -0.1851003
    ##       chlorides fixed.acidity
    ## [1,] -0.1289066     0.1240516
  • quality have very weak correlations to 2 other chemical features:

    ##      free.sulfur.dioxide residual.sugar
    ## [1,]         -0.05065606     0.01373164
The feature of interest is ‘quality’. Notably it interacts with other chemical features below:
  • quality increases as following increase:
    • alcohol: alcohol as main ingredient of wine or any liquior, it is positively contributing to wine quality
    • sulphates: sulphates acts as antimicrobial and antioxidant to keep wine fresh, so it may help with wine quality
    • citric.acid: citric.acid can boost wine quality, as it can add ‘freshness’ and flavor to wines
  • quality decreases as following increase:
    • volatile.acidity: it might cause unpleasant, vinegar taste if it is in high level, contributing to lower quality
    • total.sulfur.dioxide: higher SO2 in wine can react with alcohol, further lead to lower alcohol level and quality
Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Other interesting relationships include:

  • density increases as increased heavier (molecular weight vs. water) chemical additive or remainings in wine, include:
    • fixed.acidity
    • citric.acid
    • residual.sugar
  • density decreases as increased lighter (molecular weight vs. water) chemical additive or remainings in wine, e.g.:
    • alcohol: alcohol is lighter than water, so higher the density, lower the alcohol level and wine quality may be
  • pH decreases as increased acidity in wine, all acid lows the pH value. These include:
    • fixed.acidity
    • citric.acid
  • alcohol is some what positively correlated to pH, since it’s pH higher than 7.

  • pH itself is very less related to wine quality:
    • it is higher by alcohol and lower by sulphates, fixed.acidity and citric.acid, but all these contributes to higher quality alcohol, as a sum the correlation of pH to wine quality is minimum
What was the strongest relationship you found?
  • the list of strong relationships:
    • fixed.acidity <=> citric.acid

      ## [1] 0.6717034
    • fixed.acidity <=> density

      ## [1] 0.6680473
    • fixed.acidity <=> pH

      ## [1] -0.6829782
    • volatile.acidity <=> citric.acid

      ## [1] -0.5524957
    • volatile.acidity <=> quality

      ## [1] -0.3905578
    • citric.acid <=> pH

      ## [1] -0.5419041
    • free.sulfur.dioxide <=> total.sulfur.dioxide

      ## [1] 0.6676665
    • alcohol <=> density

      ## [1] -0.4961798
    • alcohol <=> quality

      ## [1] 0.4761663
  • the strongest relationship is:
    • fixed.acidity <=> pH

      ## [1] -0.6829782
  • the strongest relationship to wine quality is:
    • alcohol <=> quality

      ## [1] 0.4761663

Multivariate Plots Section

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?
Multivariate (density and scatter) plots above tell us that higher quality wine normally comes with:
  • higher alcohol level
  • lower volatile.acidity level
  • higher sulphates level
  • higher citric.acid level
Earlier bivariate plots also told us that higher quality wine might comes with:
  • lower total.sulfur.dioxide
  • lower chlorides
  • higher fixed.acidity
Likely higher Citric Acid level may also lead to higher Fixed Acidity and Sulphates level.
Were there any interesting or surprising interactions between features?
As an observed variable - density is made higher by most other chemical additive or remainings, except for major ingredient - alcohol (lighter than water)
As another observed variable - pH is made lower by most other chemical additive or remainings, except for major ingredient - alcohol.
I consider both density and pH dependent observation variables decided by other chemical features, so I could ignore these two in modelling below.
OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.
According to all previous findings and review of the relationships between quality to 7 major variables:
##        alcohol volatile.acidity sulphates citric.acid total.sulfur.dioxide
## [1,] 0.4761663       -0.3905578 0.2513971   0.2263725           -0.1851003
##       chlorides fixed.acidity
## [1,] -0.1289066     0.1240516
We can build a linear model and use those 7 chemical variables (2 others are negligiable to quality) in the linear model to predict the quality of a red wine.
## 
## Calls:
## m1: lm(formula = quality ~ alcohol, data = red)
## m2: lm(formula = quality ~ alcohol + volatile.acidity, data = red)
## m3: lm(formula = quality ~ alcohol + volatile.acidity + sulphates, 
##     data = red)
## m4: lm(formula = quality ~ alcohol + volatile.acidity + sulphates + 
##     citric.acid, data = red)
## m5: lm(formula = quality ~ alcohol + volatile.acidity + sulphates + 
##     citric.acid + total.sulfur.dioxide, data = red)
## m6: lm(formula = quality ~ alcohol + volatile.acidity + sulphates + 
##     citric.acid + total.sulfur.dioxide + chlorides, data = red)
## m7: lm(formula = quality ~ alcohol + volatile.acidity + sulphates + 
##     citric.acid + total.sulfur.dioxide + chlorides + fixed.acidity, 
##     data = red)
## 
## =====================================================================================================
##                            m1         m2         m3         m4         m5         m6         m7      
## -----------------------------------------------------------------------------------------------------
##   (Intercept)            1.875***   3.095***   2.611***   2.646***   2.843***   2.985***   2.652***  
##                         (0.175)    (0.184)    (0.196)    (0.201)    (0.205)    (0.206)    (0.240)    
##   alcohol                0.361***   0.314***   0.309***   0.309***   0.295***   0.276***   0.288***  
##                         (0.017)    (0.016)    (0.016)    (0.016)    (0.016)    (0.017)    (0.017)    
##   volatile.acidity                 -1.384***  -1.221***  -1.265***  -1.222***  -1.104***  -1.173***  
##                                    (0.095)    (0.097)    (0.113)    (0.112)    (0.115)    (0.118)    
##   sulphates                                    0.679***   0.696***   0.721***   0.908***   0.888***  
##                                               (0.101)    (0.103)    (0.103)    (0.111)    (0.111)    
##   citric.acid                                            -0.079     -0.043      0.065     -0.203     
##                                                          (0.104)    (0.104)    (0.106)    (0.145)    
##   total.sulfur.dioxide                                              -0.002***  -0.002***  -0.002***  
##                                                                     (0.001)    (0.001)    (0.001)    
##   chlorides                                                                    -1.763***  -1.576***  
##                                                                                (0.403)    (0.408)    
##   fixed.acidity                                                                            0.037**   
##                                                                                           (0.014)    
## -----------------------------------------------------------------------------------------------------
##   R-squared                  0.2        0.3        0.3        0.3        0.3        0.4        0.4   
##   adj. R-squared             0.2        0.3        0.3        0.3        0.3        0.3        0.4   
##   sigma                      0.7        0.7        0.7        0.7        0.7        0.7        0.7   
##   F                        468.3      370.4      268.9      201.8      167.0      143.9      124.9   
##   p                          0.0        0.0        0.0        0.0        0.0        0.0        0.0   
##   Log-likelihood         -1721.1    -1621.8    -1599.4    -1599.1    -1589.7    -1580.2    -1576.5   
##   Deviance                 805.9      711.8      692.1      691.9      683.8      675.7      672.6   
##   AIC                     3448.1     3251.6     3208.8     3210.2     3193.5     3176.4     3171.1   
##   BIC                     3464.2     3273.1     3235.7     3242.4     3231.1     3219.4     3219.5   
##   N                       1599       1599       1599       1599       1599       1599       1599     
## =====================================================================================================
  • The variables in this linear model can account for 40% of the variance in the quality of red wines.

  • Strength of this model:
    • It is easy and fast
  • Limitation of this model:
    • R square value (the goodness of fit) is not high enough (0.4)

Final Plots and Summary

Plot One

Accordingly we have per quality alcohol statistics as:
## red$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.400   9.725   9.925   9.955  10.580  11.000 
## -------------------------------------------------------- 
## red$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00    9.60   10.00   10.27   11.00   13.10 
## -------------------------------------------------------- 
## red$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.5     9.4     9.7     9.9    10.2    14.9 
## -------------------------------------------------------- 
## red$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.80   10.50   10.63   11.30   14.00 
## -------------------------------------------------------- 
## red$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.50   11.47   12.10   14.00 
## -------------------------------------------------------- 
## red$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.80   11.32   12.15   12.09   12.88   14.00
Description One

I choose this plot in the final to confirm the hypothesis that higher alcohol level does contribute to better wine quality.

  • This diagram does show a clear trend that quality increases as average alcohol level (range) increases, with only one exception at quality of 5, which holds the lowest statistic (median=9.7% and mean=9.9%) alcohol level
  • From the statistic table, we see that quality:5 holds the highest Max. alcohol level at 14.9%, but it is an outlinear
  • Also quality:3 holds the lowest Min. alcohol level at 8.40%, and it is also an outlinear
  • It also tells that except for high quality[7:8] wines, mean of alcohol levels are above median
  • For high quality wines (7 & 8), median of alcohol levels are slightly above mean

I think this result does make sense, as wines with ~12% alcohol level are just right for the taste :)

Plot Two
Another question to ask is: Can average ‘quality’ per ‘alcohol’ level also be found?

Presumably higher ‘alcohol’ lines must have higher ‘quality’ average (or median)

But ‘alcohol’ is not categorical variable, so the question is inferred to draw distribution density of ‘alcohol’ level by different quality

Description Two
  • This shows a trend that higher quality wines have alcohol levels span further to the higher volumes.
  • This also confirms the hypothesis that at higher alcohol level (vertical lines in plot), the average quality (integral by color weight) must be higher.
Plot Three
Since I consider density an observed variable, and have not used it in the linear modeling, I want to find some visual evidence that it has strong correlations with other variables, which in turn contribute to wine quality.

since ‘alcohol’ is the most correlated (0.4761663) to wine quality, and it is also notable correlated (-0.4961798) to density, it is pefect to plot density vs. alcohol by quality

Description Three
  • This clearly indicates that samples with increased alcohol normally have lower density.
  • Coincidentally most higher quality samples come with both higher alcohol and lower density!
I can also confirm this theory by showing that add ‘density’ to prior linear model would not increase its goodness notably:

Previously we have summary of m7 as:

## 
## Call:
## lm(formula = quality ~ alcohol + volatile.acidity + sulphates + 
##     citric.acid + total.sulfur.dioxide + chlorides + fixed.acidity, 
##     data = red)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.72028 -0.37289 -0.06422  0.45556  2.01920 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           2.6516821  0.2402263  11.038  < 2e-16 ***
## alcohol               0.2876488  0.0170070  16.914  < 2e-16 ***
## volatile.acidity     -1.1733907  0.1177349  -9.966  < 2e-16 ***
## sulphates             0.8877424  0.1108192   8.011 2.18e-15 ***
## citric.acid          -0.2030352  0.1452195  -1.398 0.162270    
## total.sulfur.dioxide -0.0019662  0.0005278  -3.725 0.000202 ***
## chlorides            -1.5757580  0.4080225  -3.862 0.000117 ***
## fixed.acidity         0.0367162  0.0136220   2.695 0.007105 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6502 on 1591 degrees of freedom
## Multiple R-squared:  0.3546, Adjusted R-squared:  0.3518 
## F-statistic: 124.9 on 7 and 1591 DF,  p-value: < 2.2e-16

Then:

m8 <- update(m7, ~ . + density)

we have summary of m8 as:

## 
## Call:
## lm(formula = quality ~ alcohol + volatile.acidity + sulphates + 
##     citric.acid + total.sulfur.dioxide + chlorides + fixed.acidity + 
##     density, data = red)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.73073 -0.36653 -0.06734  0.45255  1.98186 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           2.816e+01  1.508e+01   1.867 0.062042 .  
## alcohol               2.680e-01  2.059e-02  13.019  < 2e-16 ***
## volatile.acidity     -1.137e+00  1.196e-01  -9.505  < 2e-16 ***
## sulphates             9.163e-01  1.120e-01   8.179  5.8e-16 ***
## citric.acid          -1.982e-01  1.452e-01  -1.365 0.172321    
## total.sulfur.dioxide -1.907e-03  5.287e-04  -3.606 0.000320 ***
## chlorides            -1.584e+00  4.078e-01  -3.883 0.000107 ***
## fixed.acidity         5.473e-02  1.729e-02   3.167 0.001572 ** 
## density              -2.558e+01  1.512e+01  -1.692 0.090896 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6498 on 1590 degrees of freedom
## Multiple R-squared:  0.3558, Adjusted R-squared:  0.3525 
## F-statistic: 109.8 on 8 and 1590 DF,  p-value: < 2.2e-16
  • It confirms that the R-squared does NOT go much higher (m7 -> m8) at all!

Reflection

The red wine data set contains information on 1599 samples. I started by understanding individual variables and classifying them to three classes: 9 chemical variables, 2 observed measurement variables (density and pH), and 1 response target variable (quality). Then I further explored the relationships between quality and other variables, and major relations among other variables via plots and statistical analysis. Eventually I end up with a linear model between sampled quality and alcohol, volatile.acidity, sulphates, citric.acid, total.sulfur.dioxide, chlorides and fixed.acidity to predict wine quality. The model acuracy isn’t high enough (from R^2), leaving a room to improve.

The assumption that density and pH are purely determined by other 9 chemical variables may not hold true in real world, reason being the water quality from different winery or vinyard may be different, that could lead to different base pH (maybe slightly different in density as well) in wine water to start with, to this extend pH can be sometimes independent, but it is not looked as one in this analysis.

From the result, I think there are other key features that are not included in data set are contributing to wine qualities, in short I can NOT image the quality of red wine can be highly coefficiently determined by the 11 features of this data set, in real world red wine quality can majorly rely on some other variables as well, e.g. grape type, blend flavor additives, etc.